Code
library(tidyverse)
library(readr)
library(readxl)
library(cowplot)
library(viridis)
library(patchwork)
theme_set(theme_cowplot(14))The partition of the existing bacteria into biological units have been debated over time, leaving us with a working framework where bacterial species can be divided based based on their genetic content. However, bacterial species are non-stationary groups that are in constant evolution, dynamically gaining and losing genetic material while keeping a core of functions unchanged even across large geographical distances. Moreover, the functional content of a given pair of strains can largely differ due to their evolutionary history, meaning that they can have distinct ways to operate with their environment. Besides, the functional characterization of these strains remains an open challenge due to their, sometimes, extremely large diversity and lack of experimental validation. This project is meant to explore the functional landscape of the Escharichia coli pangenome, one of the most sequenced bacterial species at this moment.
library(tidyverse)
library(readr)
library(readxl)
library(cowplot)
library(viridis)
library(patchwork)
theme_set(theme_cowplot(14))Strain information has been downloaded from NCBI genomes. The metadata was downloaded on January 11th 2024. To reduce the number of E. coli genomes, the assembly level allowed was only for the types of “chromosome”, “complete genome” or “scaffold”. We can have a look at parameters such as the contig_n50/scaffold_n50 to see how the distribution of contigs looks like in the full cohort. We can see two different peaks, one near 0 that represents all the assemblies that have multiple contigs, and another around 5 megabases
metadata = read_xlsx('ecoli_ncbi_metadata.xlsx')
metadata = metadata %>%
distinct(assembly_name, .keep_all = T)
metadata = metadata %>%
# remove the string after the dot in assembly_id
mutate(assembly_id_simp = str_remove(assembly_id, "\\..*"),
.before = "assembly_name")
metadatametadata %>%
ggplot(aes(contig_n50)) +
geom_histogram(binwidth = 10000) +
scale_x_continuous(labels = scales::comma) +
labs(x = "Contig N50", y = "Number of genomes") +
theme_cowplot(14) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) 
# plot distribution of scaffold_n50
metadata %>%
ggplot(aes(scaffold_n50)) +
geom_histogram(binwidth = 10000) +
scale_x_continuous(labels = scales::comma) +
labs(x = "Scaffold N50", y = "Number of genomes") +
theme_cowplot(14) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) 
It is even more evident that there are huge differences in terms of the scaffold N50 parameter in respect to their assembly level. As expected, we have a greater count of scaffold at chromosome or complete genome level compared to scaffold. The assemblies from the first two categories will be included, and now we need to include a threshold for the last category to remove strains with low scaffold N50.
# plot scaffold distribution by assembly level with violin plot
metadata %>%
ggplot(aes(assembly_level, scaffold_n50, fill = assembly_level)) +
geom_violin(show.legend = F) +
# plot mean and sd as pointrange
stat_summary(fun.data = mean_sdl, geom = "pointrange",
fun.args = list(mult = 1), color = "black",
show.legend = F) +
scale_y_continuous(labels = scales::comma) +
labs(x = "Assembly level", y = "Scaffold N50") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_fill_viridis(discrete = T, begin = 0.2, end = 0.8) +
theme_cowplot(14)
When we filter out the complete categories and then show the distribution of the scaffold N50, we have this distribution:
# show scaffold distribution of assembly_level == scaffold
p1 = metadata %>%
filter(assembly_level == 'Scaffold') %>%
ggplot(aes(scaffold_n50)) +
geom_histogram(binwidth = 10000) +
scale_x_continuous(labels = scales::comma) +
labs(x = "Scaffold N50", y = "Number of genomes") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme_cowplot(14)
p2 = metadata %>%
filter(assembly_level == 'Scaffold') %>%
filter(scaffold_n50 < 500000) %>%
ggplot(aes(scaffold_n50)) +
geom_histogram(binwidth = 10000) +
scale_x_continuous(labels = scales::comma) +
labs(x = "Scaffold N50", y = "Number of genomes") +
theme_cowplot(14) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# show them with patchwork
p1 + p2
metadata_filt = metadata %>% filter(scaffold_n50 > 100000)
dims = dim(metadata_filt)
dims[1][1] 11860
From the distribution of the scaffold N50 values, we can safely chose a threshold of ~ 150K as that will leave out E. coli assemblies that have a high amount of contigs/scaffolds.
In summary: we are choosing a threshold of 150K for the scaffold_n50 parameter from now on. Anything below that will be discarded.
After this filtering step we have 11860 genomes left.
We need to filter the bad quality genomes from the cohort, which we can start doing by filtering out by checkM metricsL:
checkM completeness: the value that is usually applied ranges from 70%, but I’ll be using a more strict value of 95% because I want to have full genomes when possible.
checkM contamination: I am applying a value of 1% max contamination to avoid getting false information regarding the gene content.
metadata_filt %>%
mutate(checkM_completeness_class =
case_when(is.na(checkM_completeness) ~ 'NA',
checkM_completeness < 95 ~ "< 95%",
checkM_completeness >= 95 ~ ">= 95%")) %>%
count(checkM_completeness_class) %>% print# A tibble: 2 × 2
checkM_completeness_class n
<chr> <int>
1 < 95% 122
2 >= 95% 11738
metadata_filt %>%
mutate(checkM_contamination_class =
case_when(is.na(checkM_contamination) ~ 'NA',
checkM_contamination > 1 ~ "> 1%",
checkM_contamination <= 1 ~ "<= 1%")) %>%
count(checkM_contamination_class) %>% print# A tibble: 2 × 2
checkM_contamination_class n
<chr> <int>
1 <= 1% 9421
2 > 1% 2439
metadata_filt = metadata_filt %>%
filter(checkM_completeness > 95,
checkM_contamination < 1) %>%
filter(seq_length < 7000000)Finally, after calculating the number of contigs per assembly, I have seen that some of them have a high number of contigs, which is not a good sign for the quality of the assembly. I have decided to filter out genomes with more than 300 contigs.
metadata_filt = read_csv("tables/ecoli_ncbi_metadata_filt.csv")
p1 = metadata_filt %>%
# filter(Contig_number < 500) %>%
ggplot(aes(Contig_number)) +
geom_histogram(binwidth = 1) +
scale_x_continuous(labels = scales::comma) +
labs(x = "Number of contigs", y = "Number of genomes") +
theme_cowplot(14)
p2 = metadata_filt %>%
filter(Contig_number < 300) %>%
ggplot(aes(Contig_number)) +
geom_histogram(binwidth = 1) +
scale_x_continuous(labels = scales::comma) +
labs(x = "Number of contigs", y = "Number of genomes") +
theme_cowplot(14)
p1 + p2
metadata_filt = metadata_filt %>%
filter(Contig_number < 300)Another filter that I have applied is using the phylogroup characterization of the strains with the tool EzClermont1. As we can see in the next plot with the phylogroup distributions, we have some that failed or that belong to cryptic groups. I have discarded these.
phylogroups = read_tsv("tables/phylogroups_ezclermont.tsv",
col_names = F) %>%
rename(genome_id = `X1`,
phylogroup = `X2`) %>%
mutate(assembly_id = str_sub(genome_id, 1, 15))
phylogroups %>%
ggplot(aes(phylogroup, fill = phylogroup)) +
geom_bar(show.legend = F) +
labs(x = "Phylogroup", y = "Number of genomes") +
theme_cowplot(14) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
library(tidyverse)
metadata_final = read_csv("tables/MAIN_ecoli_NCBI_metadata.csv")
dim_filt = metadata_final %>% dimSummarising all the filtering steps I have performed:
filtered out genomes with a scaffold N50 lower than 150K.
filtered out genomes with a checkM completeness lower than 95%.
filtered out genomes with a checkM contamination higher than 1%.
filtered out genomes with a sequence length greater than 7Mbp.
filtered out genomes with a contig number higher than 300.
filtered out genomes that had a higher mash difference than 0.05.
filtered out genomes that were not classified as one of the main E. coli phylogroups
In addition to the strains from the NCBI database, I have included the strains we are using in our lab. These strains are not part of the NCBI database, so they are not included in the filtering steps. In total we are adding 729 strains that belong also to the normal phylogroups as before.
In summary, if we apply all filters, we are left with 9566 genomes.
The distribution of the different phylogroups represented in the E. coli cohort follow a similar trend as in other research publications2. The phylogroup B1 is the most represented, followed by phylogroup A and B2. The rest of the phylogroups are less represented, with phylogroup U being the least represented.
phylo_colors = c("A" = "#1b9e77",
"B1" = "#d95f02",
"B2" = "#7570b3",
"D" = "#e7298a",
"E" = "#66a61e",
"F" = "#e6ab02",
"G" = "#a6761d",
"C" = "#3A254D",
"U" = "#666666")
metadata_final %>%
ggplot(aes(phylogroup, fill = phylogroup)) +
geom_bar(show.legend = F) +
labs(x = "Phylogroup", y = "Number of genomes") +
scale_fill_manual(values = phylo_colors) +
theme_cowplot(14) 
The E. coli strains were selected from the NCBI genome database. The metadata was downloaded on January 11th 2024. The strains were filtered based on the following criteria: genomes that were not at the assembly level of “chromosome”, “complete genome” or “scaffold” were removed. The scaffold N50 was used to filter out genomes with a value lower than 150K. The checkM metrics were used to filter out genomes with a completeness lower than 95% and a contamination higher than 1%. Mash distances were calculated pairwise between all genomes, and those distances that were higher than 0.05 were removed. Finally, genomes with a sequence length greater than 7Mbp, and/or genomes with a contig count higher than 300 were removed. Genomes were dowloaded with the NCBI dataset download tool (v. 16.2.0). The final number of genomes was 9566.
The gene annotation was performed with Bakta (v. 1.9.3) using the full database (v. 5.1) and by default parameters. E. coli phylogroups were determined with EzClermont (v. 0.7.0)1 based on the approach from CermonTyping3. Genomes that did not classify into the main E. coli phylogroups or that were cryptic were discarded for following analyses.